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Abstract 

As computing technology continues to advance, computational modeling of scientific and engi- 
neering problems produces data of increasing complexity: large in size and unstructured in shape. 
Volume visualization of such data is a challenging problem. This paper proposes a distributed 
parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. 
Both the data and the rendering process are distributed among processors. At each processor, 
ray-casting of local data is performed independent of the other processors. The global image 
compositing processes, which require inter-processor communication, are overlapped with the 
local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from 
previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, 
and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from 
two to 128 processors show, on average, about 60% parallel efficiency. 


iThis research was supported in part by the National Aeronautics and Space Administration under NASA 
contract NAS1-19480 while the author was in residence at the Institute for Computer Applications in Science and 
Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001. 


l 



1 Introduction 


Computational modeling of scientific and engineering problems with complex geometries often 
uses finite volume methods or finite element approximations, and thus calculations are carried 
out on unstructured grids. Typically, the problem domain is decomposed into small cells, called 
elements. Popular element types include the tetrahedron, triangular prism (pentahedron) and 
hexahedron. Many visualization techniques have been developed for the interrogation and anal- 
ysis of unstructured-grid data. While exterior face rendering and cutting plane methods remain 
the most common and affordable techniques, three-dimensional methods such as direct volume 
rendering have received considerable attention because they can capture the overall data domain 
in a single image, and are capable of revealing complex features in the data that traditional 
three-dimensional graphics techniques fail to represent. 

For most scientific and engineering problems, large-scale simulations can generate data with 
hundreds of thousands of elements or more. The absence of a simple indexing scheme for three- 
dimensional unstructured grids makes direct volume rendering a computationally expensive pro- 
cess. Since parallel processing enables the solution of many other compute-intensive problems, 
computer graphics and visualization researchers have also been exploiting various parallel meth- 
ods for volume rendering. 

In this paper, we describe a distributed parallel volume ray-casting algorithm for visualizing 
unstructured-grid data. This algorithm differs from previous ones in several ways: it is completely 
distributed, less view-dependent, reasonably scalable, and flexible. First, both the data and the 
rendering computation are distributed across the available processing nodes. Inter-processor 
communication is only needed for the image compositing step. At each processor, ray-casting 
of local data is performed independent of other processors. Image compositing is overlapped 
with the ray-casting processes to achieve higher parallel efficiency. Second, the overhead due to 
view changes is kept to a minimum since a good distributed rendering algorithm must cope with 
frequent view changes to support truly interactive data exploration. Third, while using more 
processing nodes increases the number of image compositing layers, the image area that each 
processor must handle decreases; as a result, the algorithm is scalable. Without dynamic load 
balancing, we have achieved, on average, 60% parallel efficiency on the Intel Paragon using from 
two to 128 processors. Complete test results and some performance studies are presented at the 
end of the paper. 

Finally, although the prototype implementation handles only tetrahedral cells, the algorithm 
can be generalized to handle a mix of cell types and arbitrary object geometry, such as objects 
with holes and concavities. Note that while this paper focuses on the design and implementation 
of a parallel Tenderer as a postprocess, another equally important goal is to support runtime 
monitoring of large-scale parallel simulations, which may generate data that is too large to move 
elsewhere for postprocessing. This algorithm is flexible enough to support this goal as well. 
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2 Visualization on Unstructured Grids 


Data visualization on unstructured grids requires multiple steps to obtain maximum efficiency due 
to the irregularity of the grids. Yarmarkovich and Gelberg [23] describe preprocessing methods to 
achieve interactive visualization for techniques like exterior-face rendering, slicing and iso-surface 
rendering. Gallagher and Nagtegaal [4] present an algorithm based on the marching cubes iso- 
surface extraction method [10]. However, unlike the basic marching cubes algorithm, the surface 
patches derived are represented by parametric bi-cubic polynomials to achieve visually smooth 
appearance. These surface patches are then subdivided into planar polygons for rendering using 
hardware Gouraud shading. Koyamada and Nishio [9] describe two approaches for extracting 
iso-surfaces from tetrahedral elements. One is based on the marching cubes algorithm and the 
other is a ray-tracing method. In the ray-tracing approach, a diffusive iso-surface is made by 
raising the opacity in the region containing the iso- value. 

For direct volume rendering, there are generally two approaches: ray-tracing and projection. 
Garrity [5] uses a ray-tracing approach for rendering tetrahedra. Elements of other types are 
handled by first subdividing them into tetrahedra. For each ray, exterior faces are tested to find 
the first intersection point, and subsequent intersection points can be efficiently calculated by 
using the connectivity between elements. Giertsen [6] proposes a different ray-tracing approach 
by slicing the data along each horizontal scanline. The intersection of the slicing plane and the 
data elements results in a set of planar polygons, which are triangulated. Each resulting triangle 
is broken into segments and pixel-aligned segments are composited to form an image. 

In the projection approach, elements are first sorted in visibility order. Then each element 
is projected onto the screen in either front-to-back or back-to-front order. The element’s con- 
tribution to each pixel is calculated and blended with existing values. Williams [21] develops 
an algorithm for determining the visibility order of elements. Max, Hanrahan and Crawfis [13] 
describe an accurate, but computationally expensive, analytic illumination model for use with 
their projection algorithm. Shirley and Tuchman [16] propose a splatting algorithm which pro- 
vides a fast approximation to the projection process. Williams [20] further approximates Shirley 
and Tuchman’s splatting algorithm to achieve better interactive rendering rates. These fast ap- 
proximation methods are suggested for use in data previewing, rather than producing realistic 
or accurate visualization. 

On the other hand, the development of massively parallel rendering algorithms for irregular 
data has been rather sparse. Notably, Williams [22] developed a cell-projection volume rendering 
algorithm for finite element data running on a single SGI multiprocessor workstation. Uselton 
[19] implemented a volume ray-tracing algorithm for curvilinear grids on a similar platform. 
The tests were performed for up to eight processors and high parallel efficiency was obtained. 
Challinger [2] developed a parallel volume ray-tracing algorithm for nonrectilinear grids and 
implemented it on the BBN TC2000, a multiprocessor architecture with up to 128 nodes. Note 
that all three of these used shared-memory parallel computers. Finally, Giertsen and Petersen 
[7] designed a scanline volume rendering algorithm based on Giertsen’s previous work [6] and 
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Figure 1: A Distributed Parallel Rendering Process. 


implemented it on a network of workstations. Data are replicated on each workstation, and a 
master-slave scheme is used to achieve dynamic load balance. However, tests were performed 
with a maximum of four workstations, so the scalability of the algorithm and the implementation 
for massively parallel processing has yet to be demonstrated. 

3 A Parallel Rendering Algorithm 

We have developed a flexible and efficient distributed parallel volume ray-casting algorithm. 
Figure 1 depicts the parallel rendering process which consists of multiple steps. In this paper, 
we only consider rendering as a postprocess. For runtime visualization, rendering data in place 
on the same computer where the parallel simulation runs involves additional considerations; for 
example, data partitioning should comply with and be done by the simulation. As described in 
[11], these considerations do not change the basic algorithm and are not discussed here. Handling 
grids that change dynamically at runtime would be a future research topic. 
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3.1 Data Partitioning 

Partitioning of the computational domain and its associated data structures is also an important 
problem for computational researchers implementing large-scale unstructured grid calculations 
on distributed-memory computers. Most of the partitioning algorithms developed are recursive 
and based on graph bisection [8]. In essence, the computational domain, represented as an 
undirected graph, is subdivided into two subdomains based on some criterion; then the same 
criterion is applied to the subdomains recursively. A good partitioning for parallel simulation 
results in subdomains of about equal size and minimizes boundary area between subdomains. 
While the first property helps load balancing, the second reduces inter-processor communication. 

To take full advantage of a distributed-memory parallel computer like the Intel Paragon, the 
first step is to partition the data volume into p subvolumes, where p is the number of processors 
used. A perfect subdivision should produce subvolumes with identical memory and processing 
requirements to achieve good load balancing. Nevertheless, the requirements of visualization 
calculations are generally very different from the simulation’s. The criterion used by the parallel 
simulation may not be applicable. We have used a graph-based data-partitioning software pack- 
age to obtain reasonably even subvolumes, i.e. subvolumes containing about the same number 
of elements. 

3.2 Data Preprocessing 

An unstructured-grid data set is composed of at least a set of nodes and a list of elements that 
are constructed from the nodes. At each node, its coordinates [x, y, z] and some function values 
like density or pressure are stored. In order to efficiently execute the visualization processes, we 
need to know more than the existing element-node relationship. A data structure which we call 
the hierarchical data structure (hds) is created during a preprocessing step. An hds consists of 
three layers of data structures. A node set is a data structure at the lowest layer, from which 
a face set is constructed. The top layer of the hds is an element set constructed from the face 
set. Therefore, an hds records the face-node, element-face, and element-node relationships, which 
allow fast information retrieval in the expense of additional memory space. 

The subsequent ray-casting operations always begin at the boundary of the unstructured 
domain, which is formed by the exterior faces. An exterior face is a face that is not shared by 
elements. As a result of data partitioning, there are two types of exterior faces: globally and 
locally. Globally exterior faces represent the boundary of the whole volume. Locally exterior 
faces form the boundary of each subvolume. After data partitioning, while a globally exterior 
face is always a locally exterior face, a globally interior face might become a locally exterior one. 
Since our algorithm need not distinguish the globally exterior faces from the local ones, for the 
rest of the discussion, when we mention an exterior face, we mean a locally exterior face. 

According to the current viewing position, an exterior face can be either a visible or an invisible 
one. A ray enters a subvolume at a visible exterior face and exits at an invisible exterior one, 
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Figure 2: Face Types in a Subvolume. 

as shown in Figure 2. The visible exterior faces are orthographically projected onto the screen. 
This projection produces a set of convex polygons in screen coordinates, which define the exact 
screen area for casting rays. So it is necessary to distinguish the visible exterior faces from the 
invisible ones. 

It is clear that we can separate the view-dependent preprocessing from the view-independent 
preprocessing calculations. Therefore, after data are distributed to processors, each processor 
performs two preprocessing steps independent of other processors. The first step, the view- 
independent one, finds connectivity between elements, identifies all exterior faces, calculates face 
normals, etc. This step needs to be done only once. The view-dependent step then follows to 
extract all visible faces from the set of exterior faces; whenever the viewing position is changed, 
the exterior face list is searched to find a new set of faces visible from the current viewing position. 

3.3 Ray-Casting 

The local rendering process traverses the visible-face list and performs ray-casting in scanline 
order from face to face. A ray is cast from the eye, enters the data domain through an exterior 
face and marches element by element until it exits from the domain through another exterior face. 
The irregular shapes common to unstructured data often result in concavity and holes in the 
body of the data volume. Especially for distributed computing, the original grid is partitioned 
into subgrids that often have saw-toothed boundaries as shown in Figure 3 and the left image 
of Color Plate 1. Thus a ray may enter and exit multiple exterior faces. With the connectivity 
information, the casting of a ray is straightforward except for some degenerate elements [5]. 

The intensity of the ray is obtained by accumulating the intensity values contributed by all 
elements visited. For each element, the equation for intensity /(a, b ) of the corresponding ray 
segment is given by 

I (a, b) = f b e~Sl^ ds I(t)dt 

Ja 

where a and b define where the ray enters and leaves the element, c is the attenuation coefficient, 
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and I(t ) is the intensity at a point t along the ray. In [13], analytic formulas are derived 
to compute the intensity accumulated in a ray segment. In general the color and the opacity 
mapping of function values cannot be defined by an analytic function, so it is difficult, sometimes 
impossible, to derive a closed form solution for computing the intensity values. For a linear 
element, a good approximation, according to the Gaussian Quadrature [18], is to compute the 
intensity value at the middle of the segment and to use that value for the segment. Then two 
points are sampled for a quadratic element and three points are sampled for a cubic element. 
The Gaussian Quadrature integration approximates a polynomial of degree 2n — 1 by using n 
points. In this way, the order of precision is (2 n — 1) which is acceptable for our purpose. 

In practice, because the color and opacity transfer functions used are usually not polynomials, 
even for a linear element, it might be necessary to sample at multiple points along the ray segment 
within the element. That is, sampling along a ray should be selected according to not only the 
data resolution and the variation of data values, but also the variation of the transfer function 
values. 

To implement adaptive sampling, considering an element with a linear interpolation function 
f( x iy, z )i if a ray P(t) enters the element at t = a and exits at t = 6, the sample rate can be 
defined as 

r = mm) - m a))) 

b — a 

where P(t) = P( 0) + fd, P( 0) is the starting point of the ray, d is the direction of the ray, and k 
is a constant which is the sampling rate for a unit change of function value. However, ray-casting 
on unstructured grids is already a very expensive operation. Adaptive sampling makes it even 
more expensive and could significantly increase the overall rendering time. 

Consequently, in our current implementation, we used a much simpler approach by precom- 
puting a dt value for sampling along a ray at a fixed rate. A good principle for sampling linear 
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elements is that, besides the entering and leaving points a and b, at least one additional point c is 
sampled for each element. Then between each pair of sample points, a reasonable approximation 
is to use the Trapezoid rule [18] to calculate the value of the corresponding interval. 

To estimate dt , we use 


l x ly 


dt = 0.5xmm(— — , 




) 


where l x , l y , and l z are the size of the bounding box (in x, y and z direction, respectively) contain- 
ing the domain, and n e is the total number of elements in the domain. Using such estimation, 
more samples would be collected than needed within a large element of constant value. The 
advantages of using a fix sampling rate is mainly simpler implementation. We applied the above 
formula to data sets of significantly different grid characteristics and found that it generates 
reasonably good sampling interval values. 

At each sample point on the ray, the interpolated function value is used to obtain a color and an 
opacity from a look-up table. The intensity value at that point is then computed using these color 
and opacity values. The final image value corresponding to each ray is formed by compositing, 
front-to-back, the intensity values of the sample points along the ray. The compositing operation 
is associative [15], which allows us to break a ray up into segments, process the sampling and 
compositing of each segment independently, and combine the results from each segment via a 
final compositing step. This is the basis for our parallel volume rendering algorithm. 


3.4 Image Compositing 

In parallel rendering of regular data, image compositing order can always be determined a priori. 
Many efficient parallel image compositing algorithms for the rendering of regular data have been 
proposed [1, 12, 14]. However, as indicated previously, an unstructured domain tends to be 
irregular in shape and may contain holes and concavities. The situation could be more severe 
for subdomains after data partitioning. Consequently, the ray segments which contribute to an 
individual pixel cannot be combined until they are all received and properly sorted in visibility 
order by the responsible processor. 

There are essentially two approaches for delivering ray segments to the responsible processors. 
While the first approach is to separate the local rendering completely from the global compositing, 
the second one is to overlap them. That is, in the first approach, the delivery and compositing 
of ray segments does not occur until the local ray-tracing process is completely finished. At the 
end of the rendering, ray segments are then divided into groups and each group is sent to the 
responsible processor. In the second approach, a ray segment can be sent immediately after it is 
generated. Therefore, the first approach would result in large messages being sent all about the 
same time, likely to induce network congestion [3]. 

On the other hand, with the second approach, smaller messages are sent throughout the entire 
course of rendering. To reduce the number of messages, groups of ray segments are sent imme- 
diately following the rendering of one locally visible face. Ray segments are divided into groups 
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Figure 4: Both the shape of the data and the data partitioning affect a ray buffer. 

according to their corresponding destinations. Then, ray segments received at each processor 
are sorted into a local ray buffer by destination pixel. Sorted ray segments are composited in 
order at the end of rendering. The final subimage generated by each processor is sent back to 
the host computer. According to our tests and others [14, 17], overlapping the ray casting and 
compositing can improve the overall image compositing performance by as much as 40%. 

There are different ways of partitioning image space [14]. As shown later, compared to the 
ray-tracing cost, the image compositing cost is relatively low with the kind of preprocessing that 
has been done. In our current implementation, for simplicity, the image space is partitioned 
evenly into horizontal strips. More elaborate image partitioning will be considered later when 
attaining interactive rendering rates. 

3.4.1 Ray Buffer 

The efficiency of the compositing process is affected by the number of ray segments stored in the 
ray buffer. A ray buffer is a two dimensional array of linked lists. For performing distributed 
image compositing, each processor is assigned a portion of the final image. A corresponding local 
ray buffer is created for storing incoming ray segments. Each incoming ray segment is sorted into 
the linked list pointed to by the corresponding pixel address. Prior to the start of rendering, each 
processor allocates an empty ray buffer according to the image decomposition scheme. During 
the course of the rendering, the size of the ray buffer grows. The final size depends on the viewing 
position, the shape of the data volume and how the data partitioning has been done. An ideal 
data partitioning would produce compact subvolumes that are simple in shape, resulting in ray 
buffers of minimum size. As shown in Figure 4, tracing of Ray 2 would produce many local ray 
segments which would then result in a much longer linked list than Ray 1 and 3 because of the 
view direction as well as the partitioning. 
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4 Test Results 


We have tested an implementation of the rendering algorithm on the Intel Paragon XP/S by 
using an artificial data set and a flow data set from a computational fluid dynamics simulation. 
Message passing was implemented by using the native communication library, NX. A host 
program runs on a service node of the Paragon for delivering data and collecting results. In 
this study, we ignore I/O problems and focus on the rendering algorithm. All timing results are 
given in seconds and are calculated by averaging the times obtained on each node from rendering 
an animation sequence of ten frames covering various viewing directions, and then selecting the 
maximum averaged value among all participating nodes. 

The artificial volume data set has some well understood properties in both data values and 
grid topology. The data domain is composed of tetrahedra of identical size. The overall domain 
is cubical in shape and the layout of elements is symmetrical. Thus we call it the cube data set. 
The highest intensity value is assigned to the center of the domain with decreasing values toward 
the boundaries of the domain. The overall volume contains 150 thousand tetrahedra. In Color 
Plate 1, the right image shows a volume-rendered image of such a data set. Unlike exterior-face 
rendering, direct volume rendering allows us to see through the volume and visualize properties 
inside the volume by assigning low opacity to low intensity data values. The left image of 
Color Plate 1 shows exterior-face rendering of a particular subvolume. We plot grid lines on the 
subvolume boundary to show the grid structure and the resolution of the data. 

The flow data set is typical in numerical modelings that use unstructured grids. It is the result 
of simulating flow over a vehicle forebody at a Mach number of 8.15 and an angle of attack of 
30 degrees. The flow field contains a detached bow shock following the shape of the forebody. 
There are about 45.5 thousand tetrahedra. The left image of Plate 2 shows an exterior-face 
rendering of this volume data. The grid lines on the exterior faces are also plotted to show 
the the highly adaptive grid structure. Since only the region surrounding the vehicle body is 
interesting, on the right side of Color Plate 2, a cropped image shows a volume rendering of that 
region. Colors are mapped to the density field such that red and yellow highlight higher density 
regions. More visualization results are displayed in Color Plate 3 and 4. Plate 3 shows direct 
volume visualization of the pressure field and Plate 4 shows Mach number. 

Data partitioning has been mainly done using Chaco [8], a software package developed by Brue 
Hendrickson and Robert Leland at Sandia National Laboratories to partition graphs. Chaco 
provides several different methods as there is no single method that is good for all types of 
data. For the cube data set, since all tetrahedra are identical in size, the partitioning results 
in subvolumes of about the same size not only in terms of the number of elements but also in 
terms of the region of space each occupies. A typical subvolume is shown in the right image of 
Color Plate 1 as the result of partitioning for eight processors. In this regard, one would expect 
each processing node to take about the same time to finish rendering. We compare the processor 
taking the maximum rendering time with the one taking minimum time. The difference between 
the two can be as high as about 40%. This difference is partly due to the early ray-termination 
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nodes 

2 

4 

8 

16 

32 

64 

128 

view indep prep 

110.8 

5.146 

2.657 

1.305 

0.731 

0.588 

0.172 

view dep prep 

60.34 

23.47 

13.21 

6.011 

3.703 

2.221 

1.508 

ray casting 

2234 

1268 

769.3 

443.2 

327.1 

177.9 

109.3 

ray-seg deliver 

4.608 

2.503 

1.657 

1.036 

0.89 

0.572 

0.548 

ray-seg sort 

62.45 

19.07 

3.725 

2.713 

2.307 

1.656 

1.303 

ray-seg merge 

53.99 

9.085 

1.139 

0.736 

0.52 

0.279 

0.146 


Table 1: Time Breakdown for Rendering the Cube Data, 480x480 Image Size. 

scheme we use. That is, a ray is terminated when the accumulated opacity value reaches unity. 
For the cube data, higher density values are mapped to higher opacity values. If a subvolume 
has a larger projected area of high density region, many rays would terminate earlier and thus 
traverse through fewer elements. 

On the other hand, for the flow data set, because the grid used for the calculations was 
generated in an adaptive manner, a large number of smaller elements occupy a relatively small 
region of space in the overall domain. Consequently, after partitioning, although the subvolumes 
are about the same size in terms of the number of elements, their physical sizes are very different. 
In this case, one would expect the smaller volume in space to project onto a small area of the 
screen and that the corresponding rendering would take far less time than the larger volume. 
Based on our test results, the difference between the maximum and the minimum time is also 
about 40%, much lower than we expected. Although many fewer rays are cast for the smaller 
projected area, the number of elements a ray must traverse is high, compared to a subvolume 
with a large projected area but small number of elements. 

Figure 5 and Table 1 show timing results for rendering the cube data onto a 480x480 image, 
using from two to 128 processors. In Figure 5, only the ray tracing time is plotted because the 
ray tracing time is completely dominant. Note that a logarithmic scale is used for both the x 
and y axes so that both execution time and speedup information can be presented in one plot. 
In all the Tables, the item names are abbreviated as follows: 
view indep prep - view independent data preprocessing time 
view dep prep - view dependent data preprocessing time 
ray casting - ray casting time 

ray-seg deliver - ray segment delivery time 

ray-seg sort - ray segment sorting time 

ray-seg merge - ray segment compositing time 

As we mentioned earlier, the image compositing time (ray segment delivery + sorting + com- 
positing ) is negligible and normally decreases as more processors are used. The time for the 
view-dependent data preprocessing is also moderate and decreases accordingly. In Table 2, tim- 
ing results using 64 processors are presented for five different image sizes. Since only the image 
compositing process requires node-to-node communication, from these tests, we would like to 
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Figure 5: Ray Casting Time for the Cube Data 


image size 

240 

360 

480 

720 

view indep prep 

0.501 

0.671 

0.588 

1.049 

view dep prep 

2.587 

2.82 

2.221 

2.995 

ray casting 

49.445 

127.78 

177.92 

480.6 

ray-seg deliver 

0.669 

0.646 

0.572 

0.861 

ray-seg sort 

1.152 

1.405 

1.656 

2.146 

ray-seg merge 

0.078 

0.173 

0.279 

0.232 


Table 2: Time Breakdown for Rendering the Cube Data, Different Image Sizes, 64 Processors. 

determine bow the overall image compositing cost would change in proportion to an increase in 
the image size. As the numbers indicate, the image compositing cost grows more slowly than 
the ray casting cost. 

Figure 6 and Table 3 show timing results for rendering the flow data onto a 480x480 image, 
using from two to 128 processors. Again, only the ray tracing time is plotted. We see a similar 
trend in these timing results. According to our timing results, the rendering cost for unstructured 
data is very high. Although we believe that we can tune our present code to achieve about 30% 
improvement in overall rendering time, near real-time rendering rates are still difficult to attain for 
large data sets. Even with faster processors like the IBM SP2, the best we can probably achieve 
is no better than one frame per second when using 256 nodes or more. Further improvement in 
overall rendering performance may be obtained by balancing the load dynamically. 
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Figure 6: Ray Casting Time for the Flow Data. 


number of nodes 

2 

4 

8 

16 

32 

64 

128 

view indep prep 

3.35 

1.659 

0.910 

0.42 

0.23 

0.11 

0.062 

view dep prep 

1.53 

0.645 

2.371 

0.919 

0.861 

0.879 

0.839 

ray casting 

1585 

677.6 

445.3 

245.5 

187.2 

86.07 

63.2 


3.02 

1.649 

0.988 

0.662 

0.502 

0.360 

0.316 

ray-seg sort 

2.749 

3.925 

3.856 

2.549 

1.755 

1.195 

0.972 

ray-seg merge 

0.841 

0.817 

0.621 

0.369 

0.224 

0.129 

0.068 


Table 3: Time Breakdown for Rendering the Flow Data, 480x480 Image Size. 
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Figure 7: Load Imbalance in Rendering the Flow Data. 

4.1 Load Imbalance 

Our timing results show that, on average, the imbalanced loads degrade the overall performance 
by at least 20%. Figure 7 shows the proportion of the total rendering time for the flow data 
(480x480 pixels) due to load imbalance, which is calculated as: 

1 tgvg 

tmax 

where t avg is the average rendering time and t max is the maximum rendering time. Note that 
this formulation to characterize load imbalance does not reveal the distribution of imbalanced 
load. As a further test, we rendered the flow data using a zoom-in view. Consequently, the 
load imbalance became worse. For the flow data and image size of 480x480 pixels, using 32 
processors and focusing on views similar to the one presented in the left image of Color Plate 2, 
the difference between the maximum and the minimum time was raised from about 40% to 66%. 
The proportion of time devoted to ray casting due to load imbalance then became about 33%. 

In summary, there are five factors which contribute to the load imbalance shown in our timing 
results: 

• subvolume size: number of elements 

• subvolume size: physical size (projected area) 

• opacity transfer functions 

• viewing angle 

• zoom in/out 
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Therefore, it will be difficult to achieve load balancing statically. In fact, we had tried a static 
scheme which works as follows. For p processors, a data volume is partitioned into n-p subvolumes 
where n is a magic number which can be determined after one or two test runs. Then the n-p 
subvolumes are distributed in a round-robin fashion among processors. So now each processor 
rendered possibly many disjointed subvolumes instead of a single large one. Consequently, the 
differences in both the average size of projected areas and in the average number of elements 
handled by each processor would become smaller as n increases. At the same time, many more 
ray segments are generated and result in higher image compositing cost. We were hoping that 
the more balanced load can greatly reduce the ray casting cost. But the timing results did not 
show consistent and significant improvement. Static load balancing is not a solution for the 
zoom-in problem which frequently arises from visualizing highly-adaptive unstructured grids. 
The development of dynamic load balancing methods for rendering unstructured-grid data will 
be our major emphasis of future research. 

5 Conclusions 

Direct volume rendering is a powerful visualization technique for many scientific and engineering 
applications. However, for large unstructured-grid data, volume rendering is too expensive to run 
on a single workstation. We have presented a data-distributed rendering algorithm for parallel 
volume ray-casting on unstructured grids. This algorithm makes possible rendering of large data 
sets that cannot fit into the main memory of a single workstation. 

We would like to point out that the two data sets we have used for testing are considered small. 
In practice, an unstructured-grid data set from computational fluid dynamics applications can 
contain several millions of elements. We have used smaller data sets for testing because they 
are more convenient and do not prevent us from revealing the performance of the rendering 
algorithms. 

Based on our timing results, the computational cost for postprocessing a data set with millions 
of elements would be tremendous, even using a massively parallel computer. Therefore, so 
far, the kind of postprocessing analyses that computational researchers can do using three- 
dimensional computer graphics techniques have been very limited. Our performance studies also 
indicate many opportunities for further optimization of the algorithm and its implementation. In 
particular, imbalanced load significantly affects the overall performance of the Tenderer. Criteria 
for data partitioning should be further studied to reduce the current load imbalance which is 
above 20%. Dynamic load balancing, though more effective, is harder to implement with a 
data-distributed approach. 

As we approach interactive rendering rates using more processors or more powerful processors, 
we may need to reevaluate both the image-space partitioning and image compositing step as the 
rendering step become less dominant. Data and image I/O, a frequently ignored problem, must 
be improved to make the overall rendering process more efficient. One important use of such 
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parallel rendering capability is to support runtime monitoring of numerical simulations running 
on a parallel computer. Therefore, our future work will mainly focus on supporting runtime 
visualization, along with the development of dynamic load balancing algorithms. 
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Plate 1: Visualization of the Cube Data. 



Plate 2: Visualization of the Flow Data. 
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Plate 3: Volume Visualization of the Pressure Field, 


Plate 4: Volume Visualization of the Mach Number. 
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